End to End data product with SoaM
The purpose of this document is to show how a generic end to end data product would look line using SoaM and Airflow.
Let’s imagine you need to run a process everyday that consists on loading the daily ABT, querying your database, transforming some datapoints, forecasting on a desired timeframe, plotting the results and sharing it by slack with your workteam. After that, promote results to the production environment. The diagram bellow show this process:
</details> In this case, SoaM and Airflow will be interacting elbow to elbow to get this running. It’s important to understand the distinctions between them.
SoaM will be you internal workflow manager, while Airflow will be your external manager. Airflow will be in charge of scheduling all of your desired tasks through a DAG and retrying if an issue arises. Meanwhile SoaM, as your internal workflow manager, is the one in charge of managing our Python logic to carry out the desired steps mentioned before.
See the following sections where we double click on this so that it’s clearer for you.
Airflow
We use Apache Airflow to create the DAG that schedules the following pipeline:
Extract the needed data from your the raw data DB and load it into your chosen database.
Next, schedule the execution the SoaM built pipeline.
Promote the results of the Soam run to production.
The key here is that Airflow takes care of scheduling on a defined basis (hourly, daily, weekly…), retries, dependencies among other tasks while the Soam pipeline is only concerned with providing funcionalities specific to the project.
SoaM Components at the core of the project’s logic
Once you have your data stored in your database, its time for SoaM to come into the scene.
Firstly,
TimeSeriesExtractor
will come into action by querying the data needed from your database and returning a ready to work PandasDataFrame
.Then, after you have your
DataFrame
, it’s time for theTransformer
. With the toolkit provided by this module you will be able to apply any SciKit-Learn transformation or even create a custom one for your specific use case.Thirdly, and once the data is fully cleaned,
Forecaster
offers you the ability to apply different Machine Learning algortihms on your data such as FBProphet or Orbit or again a custom one authered by the user to forecast your time-series on a desired time-frame.Last but not least, it’s time to plot and see the results! Here is when the
ForecastPlotter
appears and generates a beautiful plot where you will see your past data and the forecast produced in the previous step.Finally, the
Reporting
module provides tools to generate and share reports with your team or friends via Google Sheets, Email, PDF and/or Slack.
To see how some of this can be easily implemented, check our quickstart!